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Abstract 

We derive theoretical upper and lower bounds on the maximum size of DNA 
codes of length n with constant GC-content w and minimum Hamming distance 
d, both with and without the additional constraint that the minimum Hamming 
distance between any codeword and the reverse-complement of any codeword be at 
least d. We also explicitly construct codes that are larger than the best previously- 
published codes for many choices of the parameters n, d and w. 

Introduction 

Libraries of DNA words satisfying certain combinatorial constraints have applications to 
DNA barcoding and DNA computing (see e.g. and the references therein). The goal 
is to design libraries that large as possible given the constraints. 

We first review some terminology and notation — see [TfH ITT] for more context. Let Z q 
denote the g-character alphabet {0, . . . , q — 1}. By a q-ary word of length n we mean an 
element x of Z q , which we write as x = x\ ■ • -x n . A q-ary code of length n is just a subset 
of Z™, and the elements of the code are called codewords. The Hamming distance i?(x, y) 
between two g-ary words x and y of length n is defined to be the number of coordinates in 
which they differ, and the Hamming weight of x is the number of coordinates in which it 
is nonzero. The maximum cardinality of a g-ary code of length n for which the minimum 
Hamming distance between two distinct codewords is at least d is denoted A q (n, d). If we 
also require each codeword to have Hamming weight w (i.e., that the code be a constant- 
weight code), the maximum cardinality is denoted A q (n,d,w). 

A DNA code is a g-ary code with q = 4; we identify the elements 0, 1, 2, 3 G Z 4 
with the nucleotides A, C, G, T (in that order). The reverse complement of a DNA 
word x denoted by x RC , and is defined to be the word where Xj 

is the Watson-Crick complement of Xi (i.e., A = T, T = A, C = G, and G = C). By 
requiring the minimum Hamming distance between two DNA codewords to be sufficiently 
large, one can make it unlikely that a codeword hybridizes to the reverse-complement 
of any other codeword. By requiring the minimum Hamming distance between a DNA 
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codeword and the reverse-complement of a DNA codeword to be sufficiently large, one 
can make it unlikely that a codeword hybridizes to any other codeword or to itself JSj . We 
denote by Af c (n, d) the maximum size of a DNA code of length n in which H (x, y) > d 
for all distinct codewords x and y and if(x, y^) > d for all (not-necessarily distinct) 
codewords x and y. If we also require each codeword to have Hamming weight w the 
maximum cardinality is denoted Af c (n, d, w). 

The GC-content of a DNA word is defined to be the number of positions in which 
the word has coordinate C or G. It may be desirable that all codewords in a DNA 
code have roughly the same GC-content, so that they have similar melting temperatures 
(see e.g. |9J); A^ c (n, d, w) and A 4 ' (n, d, w) are defined analogously to Ai(n, d, w) and 
Af (n, d, w), except that in the former two cases it is the GC-content (rather than the 
Hamming weight) of each codeword that is required to be w. 

Theoretical upper and lower bounds on Af (n, d, w), with no restriction on GC- 
content, are given in J7j. Explicit constructions using stochastic local search [211121] and 
a "template-map" strategy [H] provide lower bounds on Af c (n, d, w) and A^ ' RC (n, d, w) 
for a limited range of parameters n, d and w. In this paper we derive theoretical upper 
and lower bounds on Af c (n, d, w) and A^°' RC (n, d, w) for all parameters, and we use lex- 
icographic constructions to find explicit codes that improve on many of the lower bounds 

in [HI23I211. 



Upper bounds 

Before giving upper bounds on the sizes of DNA codes with constant GC-content, we note 
some simple special cases: 

Proposition 1 For n > 0, with < d < n and < w < n, 

A% c {n,d,0) = A 2 {n,d) (1) 
A% c (n, d, w) = A^ c (n,d,n-w) (2) 

{4if w = n /2 
3 ifn/3 <w< n/2 or n/2 < w < 2n/3 (3) 
2 if w < n/3 or w > 2n/3 

A^ R °{n^ W ) = \ \% W = n 'l W 
4 y \ %] w n/2 

A° c (n,l,w) = Q 2 « (5) 
A^ c ' RC (n,l jW ) = \ f ( P 2 " ~ O 2 " 72 ) lfn 18 even andw %s even > (6) 
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f n ) 2 n if n is odd or w is odd. 



Proof. (0): Changing all 0's in a binary code to A's and all l's to T's gives a Hamming- 
distance-preserving bijection between the set of all binary codes of length n and the set 
of all DNA codes of length n with constant GC-content 0. 
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(J2J): Interchange A's with C's, and T's with G's. 

(jSJ): By (J2J) we may assume w < n/2. If no two codewords agree in any position, then 
there can be at most four codewords by the pigeonhole principle. Hence A(n,n,w) < 4 
for all w. If there are four codewords none of which agree in any position, then each of the 
four nucleotides must occur exactly once in each of the n positions, so the average GC- 
content of the four words is exactly n/2. This implies that A(n,n,w) < 3 for w < n/2, 
since in a code with constant GC-content w, the average GC-content is w. If three words 
each have GC-content w < n/3, then there is some position j in which none of the words 
has a C or G, and at least two of the three words must agree in this position (both A 
or both T). Hence A(n,n,w) < 2 if w < n/3. The following constructions demonstrate 
the reverse inequalities: For w = n/2, the four words A W C W , C W A W , T W G W and G W T W 
have pairwise distance n; for n/3 < w < n/2 the three words C w A n ~ w , T n ~ w C w and 
A l(n-w)/2\ G w T \(n-w)/2-} have pa i rw i se distance n; for w < n/3 the two words C w A n ~ w and 
Qwrpn-w are distance n apart. 

(@J): For w = n/2, the two words A W C W and C W A W satisfy the distance and reverse- 
complement constraints. For w ^ n/2, the word C w A n ~ w satisfies the constraints. These 
are the largest sets possible, by Q together with Theorem 7. 

(JSJ): This is the total number of DNA words of length n and GC-content w. 

(JUl): When n and w are even, there are ( 7 ^fy2 n ^ 2 words with GC-content w that are 
their own reverse complements, otherwise there are none. 



Johnson-type bounds 

A code of length n can be shortened to a (usually smaller) code of length n — 1 without 
increasing the minimum Hamming distance, by choosing any character b £ Z q and any 
position i £ {1, . . . ,n}, keeping just those codewords that have b in their i-th position, 
and then deleting the i-th position from these codewords [TH]. This procedure is used in 
proving the following bounds. 



Theorem 2 For < d < n and < w < n 



2n 

A G A c (n,d,w) < [—A GC (n-l,d,w-l)\ (7) 

9n 

A GC (n,d,w) < [ Af c {n-l,d,w)\. (8) 

n — w 

Proof. (JZj): In any set of M words with length n, minimum Hamming distance at least 
d and constant GC-content w, there is some position i in which at least \wM/2n\ code- 
words have nucleotide C, or some position i in which at least \wM/2n\ codewords have 
nucleotide G — otherwise, the average GC-content would be less than w. Keeping just 
these codewords, and deleting position i, gives a code with length n — 1, GC-content w — 1, 
and minimum Hamming distance at least d. Inequality (jSJ) is analogous, based on the 
observation that there is some position with at least \(n — w)M/2n\ A's or \(n — w)M/2n\ 
T's. 
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Remark 3 Upper bounds on Af c \n, d, w) are obtained by repeatedly applying inequal- 
ities (J7J) and (jSJ), in any order, until n = d, n = w or w = 0, at which point ((U)-© may 
be used. (Different orders of applying (J7j) and (jHI) may result in different bounds.) One 
may continue using (jHJ) even after w — (or (JJJ even after n — w), until n = d, but this 
amounts to upper-bounding A± (n, d, 0) = A 2 (n, d) with the Singleton bound, 2 n ~ d+1 
(see e.g. [3J). Tighter upper bounds for A 2 (n, d) are known for many n and d — see for 
example [To] . 

Theorem 4 Suppose there is a set of M words of length n, constant GC-content w, and 
minimum Hamming distance at least d. Write wM = nk + r with < r < n. Then 



M(M-i)d < (n-r)(M 2 - lij 2 - rir- l^j 2 - r^i 2 ) 

+ r ( M 2_ L mj2_r^2_ L M^ = ij2_rM±i i: 



(9) 



Proof. Let aj, q, ^ and tj denote the number of occurrences of A, C, G and T (respectively) 
in the i-th position of the M codewords. Note that Yli=i( c i + ft) = W M. The sum of the 
Hamming distances over all M 2 ordered pairs of codewords is D = ^2™ =1 {M 2 — af — c 2 — 
gf — tf). Subject only to the constraints that Oj + Cj + Qi + U — M for each i and that 
Yli=i( c i + = W M, the expression D is maximized when Q + (/i is as close as possible 
to wM/n for each i, when is as close as possible to tj for each i, and when Cj is as close 
as possible to gi for each i. This is also true when a i; ^ and are constrained to be 
integers, as can be proved using the same type of argument as in [T^], for example. Hence 
the right-hand-side of (jHI) is an upper bound for the sum of the M 2 pairwise Hamming 
distances. For the left-hand-side, note that since the Hamming distance between distinct 
codewords is at least d, the sum of the Hamming distances taken over all M 2 ordered 
pairs of codewords is at least MiM — 1) d. 

If we relax the constraint that the counts aj,Cj,^j and t{ be integers, Theorem 4 
simplifies to the following: 

Theorem 5 If2dn > w 2 + 4w(n — w) + {n — w) 2 , then 

A% c (n } d, w) < — - ^ — — — ^r. (10) 

2dn - (w 2 + Aw(n - w) + (n - w) 2 ) 

Remark 6 Versions of the bounds in Theorems 2, 4 and 5 for binary constant- weight 
codes IT2"] are called Johnson bounds. Johnson bounds have been generalized to g-ary 
const ant- weight codes [213 E| an d to g-ary constant- composition codes (where the number 
of occurrences of each character in each codeword is prescribed) [22] • They can also 
be generalized to a setting in which the q characters {0, . . . , q — 1} are partitioned into 
any number of subsets, with the total number of occurrences from each subset specified. 
Constant-weight codes correspond to the partition {0, . . . , q — 1} = {0} U {1, . . . , q — 1}, 
and constant-composition codes to the partition {0, . . . , q — 1} = {0} U • • • U {q— 1}. Our 
bounds for DNA codes with constant GC-content correspond to the partition {0, 1, 2, 3} = 
{0,3} U {1,2}. 
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Halving bound 

Any upper bound for Af c (n, d, w) yields an upper bound for A A ,R (n, d, w) by the 
following result, an analogue of the halving bound for DNA codes with unrestricted GC- 
content in ^7j. The same proof works here, since the reverse-complement of a DNA word 
has the same GC-content as the word itself. 

Theorem 7 For < d < n and < w < n, 

A° c > RC (n,d } w) < ^Af c (n } d,w). (11) 

Proof. If {x.i}fL\ 1S a se t °f M codewords with constant GC-content w, minimum Hamming 
distance at least d, and with F(x;, xf ) > d for all 1 < i,j < M, then {^fU U {xf }^ 
is a set of words with constant GC-content w and minimum Hamming distance at least d. 
This set has cardinality 2M provided that {xj}*£ 1 n {^ R }i=i = 0, which holds for d > 0. 



Lower bounds 
Gilbert-type bounds 

If C is set of words in Z™ with the property that the Hamming distance between any pair 
of words in C is at least d, and if C is maximal in the sense that no more points from 
can be added to C without violating this distance constraint, then the balls of Hamming 
radius d — 1 around the points in C cover all of Z™. This is the idea behind the Gilbert 
bound for g-ary codes (see e.g. (201), and a similar argument applies to constant-weight 
codes (see e.g. |3]). Here we give an analogue for DNA codes with constant GC-content: 

Theorem 8 For < d < n and < w < n, 

( n )2 n 

A% c (n,d,w)> —r- . n .„, KwJ , . (12) 

4 v ' ' ' — ir^d—l Y^minj lr/2\ ,w,n— w\ /uiWn— w\ fn— 2i\ cy2i 
^r=0 Li=0 W\ i )\r~2i) Z 

Proof. The numerator gives the total number of words with GC-content w. The denom- 
inator gives the number of these words that have distance at most d — 1 from any fixed 
codeword x. (In the denominator, f™) ("7™) ("Z?) ^ 1S ^ ne number of words y with GC- 
content w for which H(x, y) is exactly r, and for which there are exactly w — i positions 
j with Xj and yj both in {C, G}.) 

Remark 9 Replacing d— 1 with \_(d — 1)/2J as the upper index of the outer summation in 
the denominator of (|T2*j) gives an upper-bound for Af°(n, d, w), since the balls of Hamming 
radius [(d — 1)/2J centered around codewords must be disjoint. This is an analogue of 
the sphere-packing bound for g-ary codes — see e.g. [271j . 
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Now define V(n, w,d) = #{x G Z% : x has GC-content w and if(x, x RC ) = d}. Note 
that since no nucleotide is its own complement, V(n, w,d) = unless n and d have the 
same parity (i.e., are both even or are both odd). 

Lemma 10 For n = 2m and d = 2e even, 

i=max{0,iu-m,["(uj-e)/2"|} v 7 v 7 v 7 

For n = 2m + 1 and r = 2e + 1 odd, 

V{2m + 1, w, 2e + 1) = V(2m, w, 2e) + V(2m, w - 1, 2e). (14) 

Proof. In (|13p. the index i ranges over the number of positions j < m for which both 
and a^m-i+i belong to {C, G}. There are r?J ways to select these positions, and 
( m ~*\2' w ~ 2% ways to select the positions for the remaining w — 2i occurrences of C's or 
G"s. There are then m — w + % positions j < m for which both Xj and 22m-j+i belong 
to {A, T}. Note that the j-th coordinate of x necessarily differs from the j-th coordinate 
of n RC in the w — 2% positions j < m for which one of Xj and X2m-j+i is in {A, T} and 
the other is in {C, G}, so there are (^ZZ+2i) wa Y s ^° choose the remaining e — w + 2i 
positions j < m in which Xj differs from the complement of X2 m +i-j- After all these 
choices have been made, there are two choices for the nucleotide in each position j < m; 
for the m — w + 2i positions j < m for which Xj and x 2m ~j+i both belong to {C, G} or 
both belong to {A,T}, the nucleotide at X2 m -j+i is forced by the choice of Xj] for the 
other w — 2% positions j < m, there are two choices for the nucleotide a; 2 m-j+i- 

In (JHJ), the first summand gives the number of words with x m +\ G {A,T} and the 
second summand gives the number of words with x m +\ G {C, G}. 



Theorem 11 For < d < n and < w < n, 

A GC ' RC (n d w)> E?=d^(M,r) ^ ^ 

4 \ i i J — r. sr^d—l s~^mm{lr/2\,w,n—w} / w\ ( n—w\ / n—2i\ c\2i' 

Proof. The numerator gives the total number of words with GC-content u> that have 
distance at least d from their reverse-complements, and the denominator gives an upper- 
bound on the number of these words that have distance at most d — 1 from any fixed 
codeword. (The denominator is an upper-bound rather than an exact count, because the 
balls of radius d — 1 around a word and its reverse-complement might overlap, and because 
when counting the number of words in these balls we may be including some words y that 
do not satisfy the condition H(y,y RC ) > d.) 
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Lexicographic codes 

See [0] for an introduction to lexicographic codes. The idea is that all words in Z™ are listed 
in lexicographic order, i.e., with x = x\ ■ • ■ x n listed before y = y\ ■ ■ ■ y n if Xi < where i 
is the first position in which x and y differ. Then, starting with the empty code, one pro- 
ceeds down this list and adds to the code any word whose addition does not violate any of 
the combinatorial constraints. (Ordinarily these would be a Hamming distance and pos- 
sibly a Hamming weight constraint, but GC-content and reverse-complement Hamming 
distance constraints can be enforced as well.) Since the resulting lexicographic codes can 
accommodate no more codewords without a constraint being violated, they meet or ex- 
ceed Gilbert-type lower bounds; they often do much better There are many variants 
of the standard lexicographic construction, for example the words may be ordered as a 
Gray code, or one may start with an arbitrary codeword as a seed rather than with the 
empty code jl]. We used three variants, singly and in combination, to construct DNA 
codes with the desired constraints: 

(i) We used different orderings of the characters A, C, G and T when putting the 4 n 
DNA words of length n in lexicographic order. There are 4! = 24 orderings of the four 
characters, but because of the symmetry between A and T and between C and G, only 6 
of these 24 orderings need to be considered. 

(ii) We used offsets, as in jTH]: one starts at an arbitrary place in the list of words 
rather than at the beginning, and loops back around to the beginning of the list when the 
end is reached. 

(iii) We used a "factored" ordering of the DNA words. The 2 n binary words of length 
n were listed in lexicographic order, Ui = • • • 0, . . . , u 2 ™ = 1 • • • 1. As in [T7| . we define a 
mapping from pairs of binary words of length n to DNA words of length n, given by 

x y = z where z% = A if x% = and jji — 1; Z{ — C if Xi = 1 and yi — 0; Zi — G if 

x i = Ui — 1; an d Zi — T if Xi — yi — 0. Note that is a bijection, and that the Hamming 
weight of x is equal to the GC-content of z. We ordered the 4 n DNA words so that Uj0Uj 
comes before u m if % < k or if i = k and j < m. 

When combining variants (ii) and (iii) above, two offsets can be used: one for the 
binary words in the first slot of x y, and another for those in the second slot. 

We used the above three approaches to construct DNA codes with constant GC- 
content, both with and without the reverse- complement constraint, for a variety of pa- 
rameters n, d and w. Using offsets of zero and an average of about ten random offsets, 
we found codes that are larger than the codes given in El 121] for many choices of 
parameters. The sizes of the lexicographic codes are given in Tables 1 and 2, and the 
offsets used to generate these codes are given in Tables 3 and 4. 

Product bounds 

The lexicographic constructions described above do not scale well to large n. One can 
avoid the burden of explicitly computing distances between all pairs of codewords (and 
also the burden of explicitly listing all codewords) by using modifications of algebraic 
constructions such as linear codes. For example, a DNA code with minimum Hamming 
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distance at least d and constant GC-content w can be constructed by taking any linear 
code over Z 4 (or the Galois field F 4 [3] or the Kleinian four-group PU|) that has minimum 
Hamming distance d, and selecting only those codewords with exactly w occurrences of 
two fixed characters. 

In this section we give lower bounds for DNA codes that are constructed from bi- 
nary codes, binary constant-weight codes, and ternary constant-weight codes, for which 
a variety of algebraic constructions are known (e.g. [TBI I3[ ITU]). 

Note that the reverse-complement operator RC can be viewed as the composition of 
two (commuting) operators R and C, where R maps x\ • • • x n to x n • • • x\ and C replaces 
each coordinate Xi with its complement xl. We state the product bounds below in terms 
of constraints on R rather than on RC to make the arguments cleaner. (This approach 
was used in [17].) The values A R (n,d,w) and A 4 ,R (n, d, w) are defined in the same 
manner as A RC (n, d, w) and A 4 ' R (n, d, w), but with the constraint that H(x,y R ) > d 
for all codewords x and y in place of the constraint that H (x, y RC ) > d. Bounds on 
A^ c,R (n, d, w) can be used to derive bounds for A^ C,RC (n, d, w) using the following result: 

Proposition 12 For < d < n and < w < n, 

A± ' (n,d,w) = A^ ' R (n,d,w) if n is even, (16) 



AX c ' K (n, d + l,w)< A^ HC (n, d, w) < A 4 ' tt (n, d - 1, w) if n is odd. (17) 

Proof. The analogous result for DNA codes with unrestricted GC-content was proved 
in ]T7j, and essentially the same proof works here. Given a set of codewords of length 
n, if we replace all the entries in any subset of the positions by their complements, the 
GC-content of each codeword is preserved, as is the Hamming distance between any pair 
of codewords. The Hamming distance between a codeword and the reverse or reverse- 
complement of another codeword is not in general preserved, but if n is even and we 
replace the first n/2 coordinates of each codeword Xj by their complements to form a new 
word yi, then ^/"(x^x^) = H(yi,y RC ) for all codewords Xj and Xj. Similarly, if n is odd 
and we replace the first (n — l)/2 coordinates of each codeword Xj by their complements 
to form y i} then {Hfc, xf ) - H(y i: yf c )\ < 1. 

Theorem 13 For < d < n and < w < n, 



A% c (n,d,w) > A 2 (n,d,w) ■ A 2 (n,d) (18) 

A^ c ' R (n,d,w) > A R (n,d,w) ■ A 2 {n,d) (19) 

C ' R (n, d, w) > A 2 {n,d,w) ■ A R (n,d) (20) 

Af c (n,d,w) > A 3 (n,d,w) ■ A 2 (n-w,d) (21) 

A^ c ' R (n,d,w) > A R (n,d,w) ■ A 2 (n~w,d) (22) 

A4 C ' R {n, d, w) > A 3 (n,d,w) ■ A R (n-w,d) (23) 
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Proof. For (|18p and (|19p. note that if £>i is a set of binary words with length n, Hamming 
weight w and minimum Hamming distance d, and if B 2 is a set of binary words with 
length n and minimum Hamming distance d, then X> = {x y : x G B\ and y G S2} is a 
set of DNA words with length n, GC-content w and minimum Hamming distance d. If, 
in addition, H(xi,x. R ) > d for all xi,X2 G £>i, then i7(zi,z^) > d for all zi,Z2 G as 
well, since Hfa © yi, (x 2 y 2 ) R ) = #(xi © y l5 xf yf) > F(x l5 xf) > d. Inequality 
(J2(Jj) is proved in the same manner as (|19|) . 

For (J21j) - (j23)) we first define a function that maps a pair consisting of ternary word 
x of length n and Hamming weight w, and a binary word y of length n — w, to a DNA 
word z = x y of length n. This map is defined by z^ = C if Xj = 1; Zi = G if Xj = 2; 
Zi = A ii Xi is the j-th zero-entry in x and t/j = 0; and Zj = T if Xj is the j-th zero-entry 
in x and = 1. The argument now proceeds as for (j!8)) - (f2T)j) . 

Remark 14 Lower bounds for A 2 {n, d, u>) can be found in jl], lower bounds for A 2 (n, d) 
in [HE], and lower bounds for ^(n, d, w) in JH]- The bounds on ternary const ant- weight 
codes in JH] also apply directly to DNA codes with constant C-content over the three- 
letter alphabet {A, C, T}. This restricted alphabet is used by some researchers to reduce 
the probability of individual codewords having "secondary structure" such as hairpin loops 
[THIIH] — note also that if x and y are DNA words over {A, C, T} with C-content at least 
d, the reverse-complement Hamming distance constraint i/(x, y RC ) > d is automatically 
satisfied. 

Remark 15 Inequalities (|TS j) -(|2Ti |) are analogues of the product bounds for DNA codes 
with unrestricted GC-content in J7|; ()18|) is also a generalization of the "template-map" 
construction used in ^3] for codes with constant GC-content — in that construction, a 
const ant- weight binary code acts as the "template" (corresponding to the first factor in 
([H)) )) an d the same constant- weight binary code, with at most two words of other weights 
added in, acts as the "map" (corresponding to the second factor in (fTSJl ). This gives a 
DNA code of size no larger than A 2 (n, d, w) ■ A 2 (n, d), and when A 2 (n, d,w) + 2 < A 2 (n, d) 
this gives a strictly smaller code (e.g., A 2 (n,2,w) = (™), which can be much less than 
A 2 (n, 2) = 2 n_1 ). But for the parameters w = <i ~ n/2 considered in ^3], this difference 
can be inconsequential; in particular, A 2 (n,n/2,n/2) = A 2 (n,n/2) — 2 = 2n — 2 whenever 
a Hadamard matrix of order n exists [21], i.e. for all n divisible by 4 up to at least 
n = 424. Note that even when optimal binary codes are used as factors, the lower bounds 
derived from product codes are not in general tight — for instance, A 2 (12, 6, 6)-v4 2 (12, 6) = 
22-24 = 528, while we constructed a lexicographic code showing that Af c (12, 6, 6) > 736. 
In fact, product codes do not even meet the Gilbert-type lower bound for A± (2w,w,w) 
when w is sufficiently large: replacing the denominator in (fT2"j) with the upper-bound 
w {w-i)3 w ~ 1 f° r the number of words with Hamming distance at most w — 1 from a fixed 
codeword gives Af c (2w, w, w) > 3(4/3) w (w + l)/u> 2 ; the product-code construction gives 
a code of size at most A 2 (2w,w,w) ■ A 2 {2w,w) < {Aw — 2)Aw. (The "template-code" 
construction used in [TJ is similar to the template-map construction discussed above, 
but with an additional constraint to prevent codewords from hybridizing to concatenations 
of other codewords.) 
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Below we show that product codes can be optimal when d = 2: 



Theorem 16 For < w < n, 

A^°(n,2,w) = (^j2 n -\ (24) 

Proof. In one direction we have A^ c (n, 2, w) > A 2 (n, 2,w) ■ A 2 (n, 2) by (J18)) . Note that 
A 2 (n,2,w) = (™) since the Hamming distance between two distinct binary words of the 
same weight is at least two; note also that A 2 (n, 2) = 2 n ~ 1 , since the first n — 1 coordinates 
can be arbitrary with the last coordinate used as a parity check bit (see e.g. |2Uj). 

In the other direction, A% c (w,2,w) = A% c (w, 2,0) = A 2 (w,2) = 2 w ~ l = j^^ 1 , 
and if A% c (n,2,w) < (™)2 n ~ 1 for some n > w then by (0) we have A% c (n + 1,2, w) < 
2{n+ l-w)/(n+ l)(")2 n - 1 = ( n + 1 )2 n . Hence by induction Af c (n,2,w) < Q2 71 - 1 for 
all n > w. 



Theorem 17 For < w < n and n even, 

A° c ' RC (n,2,w) = I ■" )2 n -\ (25) 



w 



Proof. By (O, A^ C ' RC (n, 2, w) < lAf c (n,2,w) = |(™)2 n " 1 = ( r ^2 n ~ 2 . For n even, 
A° c > RC (n,2,w) = A% c ' R (n,2,w) by and Af (n, 2) = 2 n ~ 2 by Theorem 4.5 of |Hj. 
Thus by the product bound A^ c ' R (n,d,w) > A 2 {n,2,w) ■ A%{n,2) = {l)2 n - 2 . (Here is 
an alternate argument showing A^n, 2) = 2 n ~ 2 for n even: when n is even, the set of all 
2 n ~ l binary words of odd Hamming weight contains no palindromes, and the reverse of a 
binary word of odd weight has odd weight, so these 2 n_1 words break up into 2 n ~ 2 pairs 
{x, x^}; taking one word from each pair shows that A 2 (n, 2) > 2 n ~ 2 , since the Hamming 
distance between two distinct binary words of odd weight is at least two; equality follows 
from a halving bound, A R (n, 2) < \A 2 {n, 2) = 2 n ~ 2 [TZJ - 



Tables 

Lower bounds for A 4 ' RC (n,d,w), derived from codes constructed using stochastic local 
search, are given in [2H] and [21] for n < 12 (n even) with d < n and w = n/2. In 
Tables 1 and 2 we give lower bounds for A A ' R (n, d, w) and Af c \n, d, w) derived from 
lexicographic constructions for these same parameters. Our bounds are at least as large 
as those in ESI 121] for all parameters except the five cases marked with asterisks; those 
that are strictly larger (or for which no bounds were given) are underlined. (Our bound on 
A± (n,d,w) is not underlined if it is equal to twice the bound on A 4 ' (n,d,w) given 
in [TH 1231 123; since the former bound is then implied by the latter using the halving 
bound.) Entries followed by periods are optimal, as the lower bounds are equal to the 
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upper bounds computed using Theorems 2, 4 and 7 (the Johnson-type bounds and the 
halving bound). 

Guide to superscripts in Tables 1 and 2: 

a. Not explicitly constructed lexicographically; value from Theorem 16. 

b. Not explicitly constructed lexicographically; value from Theorem 15. 

*. Larger code constructed using stochastic local search in [21] (size given in superscript). 

Table 1. Lower bounds for A A ,R (n, d, w) with n < 12 (n even), d < n and w = n/2. 



n\d 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


4 


24. 


6. 


2. 


















6 


320. 


39* 41 


16 


4. 


2. 














8 


4480. 


384* 390 


112 


25*26 


10* 12 


2. 


2. 










10 


64512. 


4084 


795 


166 


46 


15 


6 


2. 


2. 






12 


946176.° 


49764 


8704 


1362 


306 


81 


27 


10 


4. 


2. 


2. 



Table 2. Lower bounds for Af c (n, d, w) with n < 12 (n even), d < n and w = n/2. 



n\d 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


4 


48. 


12. 


4. 


















6 


640. 


96 


40. 


8 


4. 














8 


8960. 


832 


224 


56 


2 q*24 


5. 


4. 










10 


129024. 


9344 


1676 


360 


96 


32 


16, 


5. 


4. 






12 


1892352. 6 


112640 


17408 


2992 


736 


177 


68 


22 


8 


4. 


4. 



Remark 18 In [23 and [Tl], lower bounds for A^ c (n, d, w) are given for 4 < n < 20 (n 
odd or even) with w = d = \n/2\ . Though not covered in Table 2, we also improved upon 
these bounds for n — 5, 7, 9, 11 and 13-20 using lexicographic constructions. 

Tables 3 and 4 record the nucleotide-orderings and the offsets used in constructing the 
lexicographic codes whose sizes are given in Tables 1 and 2. Entries are written either 
in the form offseti offset2 for the "factored" lexicographic variant, or as offset^, with 
the superscript k G {1,2,3,4,5,6} indicating the nucleotide-ordering used, as follows: 
1 = (A < C < G < T); 2 = (C < G < A < T); 3 = (A < T < C < G); 4 = (C < A < 
T < GO; 5 = (C < A < G < T); 6 = (A < C < T < G). Note that we list all offsets in 
base-16 rather than base-4 or base-2 for compactness, and that the offset need not itself 
be a codeword since it may not satisfy the GC-content constraint or it may be too close to 
its own reverse-complement. (We re-used seeds for our random- number generator, which 
is why some of the same "random" offsets appear for more than one entry.) 
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Table 3. Offsets used to generate lexicographic codes giving lower bounds in Table 1. 



n\d 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 12 


4 


59 1 


59^ 


O 1 
















6 


O 1 


42d 4 


12 19 


bfc 2 


O 1 












8 


5021 1 


44dd 2 


4e0 95 


d3de 5 


90a5 5 


O 1 


O 1 








10 


O 1 


5 


bfcgg 1 


5 


O 1 


codge 1 


c54c6 2 


O 1 


O 1 




12 




000 


000 


2 


4121c8 4 


o 5 


O 2 


96C697 1 


96C697 1 


1 O 1 



Table 4. Offsets used to generate lexicographic codes giving lower bounds in Table 2. 



n\d 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


4 


O 1 


O 1 


O 1 


















6 


O 1 


O 2 


434 1 


O 1 


O 1 














8 


O 1 


5021 2 


000 


2d 23 


9016 1 


o 1 


O 1 










10 


O 1 


000 


2 


000 


000 


000 


c8e60 5 


3792d 2 


o 1 






12 




000 


000 


000 


000 


C86605 1 


994 70b 


000 


o 2 


o 1 


o 1 
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